38 research outputs found

    Biomedical Data Integration in Cancer Genomics

    Get PDF
    Cancer is one of the leading causes of death in industrialized nations and its incidence is steadily increasing due to population aging. Cancer constitutes a group of diseases characterized by unwanted cellular growth which results from random genomic alterations and environmental exposure. Diverse genomic and epigenomic alterations separately and jointly regulate gene expression and stimulate and support neoplastic growth. More effective treatment, earlier and more accurate diagnosis, and improved management of cancer are important for public health and well-being. Technological improvements in data measurement, storing and transport capability are transforming cancer research to a data-intensive field. The large increases in the quality and quantity of data for the analysis and interpretation of experiments has made employing computational and statistical tools necessary. Data integration - the combination of different types of measurement data - is a valuable computational tool for cancer research because data integration improves the interpretability of data-driven analytics and can thereby provide novel prognostic markers and drug targets. I have developed two computational data integration tools for large-scale genomic data and a simulator framework for testing a specific type of data integration algorithm. The first computational method, CNAmet, enhances the interpretation of genomic analysis results by integrating three data levels: gene expression, copy-number alteration, and DNA methylation. The second computational method, GOPredict, uses a knowledge discovery approach to prioritize drugs for patient cohorts thereby stratifying patients into potentitally drug-sensitive subgroups. Using the simulator framework, we are able to compare the performance of integration algorithms which integrate gene copy-number data with gene expression data to find putative cancer genes. Our experimental results indicate in simulated, cell line, and primary tumor data that well-performing integration algorithms for gene copy-number and expression data use and process genomic data appropriately. Applying these methods to diffuse large B-cell lymphoma, integrative analysis of copy-number and expression data helps to uncover a gene with putative prognostic utility. Furthermore, analysis of glioblastoma brain cancer data with CNAmet suggests that a number of known cancer genes, including the epidermal growth factor receptor, are highly expressed due to co-occuring alterations in their promoter DNA methylation and copy-number. Finally, integration of publicly available molecular and literature data with GOPredict suggests that treating patients with FGFR inhibitors in breast cancer and CDK inhibitors in ovarian cancer could support standard drug therapies. Collectively, the methods developed here and their application to varied molecular cancer data sets illustrates the benefits of data integration in cancer genomics.Syöpä on yksi yleisimmistä kuolinsyistä teollisuusmaissa ja sen esiintyvyys kasvaa tasaisesti väestön vanhetessa. Syöpä käsittää joukon sairauksia, joiden yhteispiirteenä on ei-toivottu solujen uudiskasvu. Uudiskasvu on seurasta genomin sattumanvaraisista sekä ympäristövaikutteisista muutoksista. Monitahoiset genomiset ja epigenomiset muutokset yhdessä ja erikseen säätelevät ja ohjaavat geenien ilmentymistä sekä edesauttavat ja tukevat syövän kasvamista. Hoidon tehostaminen, aikaisempi ja osuvampi taudin määritys, ja parempi syövänhallinta ovat merkittäviä haasteita kansanterveydelle. Teknologinen kehitys tiedon mittauksessa, säilömisessä ja siirrossa on muuttanut syöpätutkimuksen dataintensiiviseksi alaksi. Aineistojen määrän ja laadun suuri kasvu on tehnyt laskennallisista ja tilastollisista menetelmistä välttämättömiä työkaluja. Data-integraatio - erilaisten mitta-aineistojen yhdistäminen - on syöpätutkimukselle arvokas laskennallinen työkalu, sillä sen käyttö parantaa aineistolähteisen tutkimuksen tulkintaa ja tällä tavoin edesauttaa uusien ennustetekijöiden ja lääkekohteiden tunnistamista. Olen kehittänyt kaksi laskennallista työkaluja suurien genomiaineistojen yhdistämiseen sekä aineistosimulaattorin erityyppisten genomisten aineistojen yhdistämisohjelmien koestamiseen. Ensimmäinen laskennallinen työkalu, CNAmet, parantaa genomiaineistojen analyysin tulkintaa yhdistämällä kolmea eri tyyppistä mittaustietoa: geeni-ilmentymän, kopiolukumuutosten ja DNA-metylaation. Toinen laskennallinen työkalu, GOPredict, käyttäen automaattista tiedonmääritystä panee lääkkeet tärkeysjärjestykseen potilaskohortissa ja täten tunnistaa mahdollisesti lääkeherkät potilasalijoukot. Aineistosimulaattorilla vertailemme eri yhdistämisalgoritmien suorityskykyä menetelmillä, jotka yhdistävät geenien kopiolukumittaustietoa ja ilmentymämittaustietoa löytääkseen mahdollisia syöpägeenejä. Kokeelliset tuloksemme simulaatio-, solulinja- ja kasvainaineistoissa osoittavat, että parhaat kopioluvun ja geeninilmentymistä yhdistävät työkalut käsittelevät kopiolukumittauksia oikealla tavalla. Kun näitä menetelmiä käytetään suurisoluiseen B-solulymfoomaan, geenien kopioluku- ja ilmentymätiedon yhdistäminen auttaa löytämään mahdollisen ennustetekijägeenin. Glioblastooma syöpäkasvaimien analysointi CNAmet-työkalulla antaa osviittaa, että osa tunnetuista syöpägeeneistä ilmenee voimakkaasti johtuen samanaikaisesti sattuvista muutoksista geenien promoottorien DNA-metylaatiossa ja geenien kopioluvussa. Lopuksi, avoimen molekulääristen ja kirjallisuusaineistojen yhdistäminen GOPredictillä antaa ymmärtää, että FGFR-estäjien käyttö rintasyövässä ja CDK-estäjien käyttö munasarjasyövässä saattaisi tukea vakiohoitoja. Kaiken kaikkiaan tässä työssä kehitetyt työkalut ja niiden käyttö monitahoisiin molekyläärisiin syöpäaineistoihin havainnollistavat data-integraation käytön hyödyllisyyden syöpägenomisten aineistojen käsittelyssä

    Language-Agnostic Reproducible Data Analysis Using Literate Programming

    Get PDF
    A modern biomedical research project can easily contain hundreds of analysis steps and lack of reproducibility of the analyses has been recognized as a severe issue. While thorough documentation enables reproducibility, the number of analysis programs used can be so large that in reality reproducibility cannot be easily achieved. Literate programming is an approach to present computer programs to human readers. The code is rearranged to follow the logic of the program, and to explain that logic in a natural language. The code executed by the computer is extracted from the literate source code. As such, literate programming is an ideal formalism for systematizing analysis steps in biomedical research. We have developed the reproducible computing tool Lir (literate, reproducible computing) that allows a tool-agnostic approach to biomedical data analysis. We demonstrate the utility of Lir by applying it to a case study. Our aim was to investigate the role of endosomal trafficking regulators to the progression of breast cancer. In this analysis, a variety of tools were combined to interpret the available data: a relational database, standard command-line tools, and a statistical computing environment. The analysis revealed that the lipid transport related genes LAPTM4B and NDRG1 are coamplified in breast cancer patients, and identified genes potentially cooperating with LAPTM4B in breast cancer progression. Our case study demonstrates that with Lir, an array of tools can be combined in the same data analysis to improve efficiency, reproducibility, and ease of understanding. Lir is an open-source software available at github. com/borisvassilev/lir.Peer reviewe

    Työdata ja tekoäly : Toiminnan kehittämisen ja johtamisen tueksi

    Get PDF
    Raportissa kerrotaan ”Työdatan hiljaiset signaalit” -hankkeen vaiheista, joilla työelämätutkimuksen uusi aineistolähde ”digital trace data”, työdata, saatiin käytettävään muotoon. Aikaleimattua dataa kerääntyy jatkuvasti ja automaattisesti työssä käytettyjen järjestelmien lokitietoihin. Työdataa tarkastella koko organisaation yhteisen tekemisen näkökulmasta. Hankkeen työdata on yhden ammattikorkeakoulun Moodlen lokitietoihin jäävät käyttötapahtumat 63 viikolta. Tapahtumia tarkasteltiin tekstinä ja käytäntöinä. Tekstiaineistoa analysoivana tekoälynä sovellettiin LDA -aihemallinnusta, joka tunnisti 17 käytäntöä. Viikkotason tarkastelu tehtiin käytäntöjen lisäksi sairauspoissaolo- ja suunnittelujärjestelmistä saaduista tiedoista. Esittelemme viikkorytmien analyysitavan ja periaatteet, joita voidaan soveltaa moniin vastaaviin aineistolähteisiin. Käytäntöjen ja työhyvinvoinnin kehittämistä on nyt mahdollista toteuttaa uudella tavalla. Jos tulevaisuudessa viikkorytmien seuranta on automaattista, voi kehittämisen tulosten seurantakin olla jatkuvaa

    Identification of sample-specific regulations using integrative network level analysis

    Get PDF
    Background: Histologically similar tumors even from the same anatomical position may still show high variability at molecular level hindering analysis of genome-wide data. Leveling the analysis to a gene regulatory network instead of focusing on single genes has been suggested to overcome the heterogeneity issue although the majority of the network methods require large datasets. Network methods that are able to function at a single sample level are needed to overcome the heterogeneity and sample size issues. Methods: We present a novel network method, Differentially Expressed Regulation Analysis (DERA) that integrates expression data to biological network information at a single sample level. The sample-specific networks are subsequently used to discover samples with similar molecular functions by identification of regulations that are shared between samples or are specific for a subgroup. Results: We applied DERA to identify key regulations in triple negative breast cancer (TNBC), which is characterized by lack of estrogen receptor, progesterone receptor and HER2 expression and has poorer prognosis than the other breast cancer subtypes. DERA identified 110 core regulations consisting of 28 disconnected subnetworks for TNBC. These subnetworks are related to oncogenic activity, proliferation, cancer survival, invasiveness and metastasis. Our analysis further revealed 31 regulations specific for TNBC as compared to the other breast cancer subtypes and thus form a basis for understanding TNBC. We also applied DERA to high-grade serous ovarian cancer (HGS-OvCa) data and identified several common regulations between HGS-OvCa and TNBC. The performance of DERA was compared to two pathway analysis methods GSEA and SPIA and our results shows better reproducibility and higher sensitivity in a small sample set. Conclusions: We present a novel method called DERA to identify subnetworks that are similarly active for a group of samples. DERA was applied to breast cancer and ovarian cancer data showing our method is able to identify reliable and potentially important regulations with high reproducibility. R package is available at http://csbi.ltdk.helsinki.fi/pub/czliu/DERA/.Peer reviewe

    Data integration to prioritize drugs using genomics and curated data

    Get PDF
    Background: Genomic alterations affecting drug target proteins occur in several tumor types and are prime candidates for patient-specific tailored treatments. Increasingly, patients likely to benefit from targeted cancer therapy are selected based on molecular alterations. The selection of a precision therapy benefiting most patients is challenging but can be enhanced with integration of multiple types of molecular data. Data integration approaches for drug prioritization have successfully integrated diverse molecular data but do not take full advantage of existing data and literature. Results: We have built a knowledge-base which connects data from public databases with molecular results from over 2200 tumors, signaling pathways and drug-target databases. Moreover, we have developed a data mining algorithm to effectively utilize this heterogeneous knowledge-base. Our algorithm is designed to facilitate retargeting of existing drugs by stratifying samples and prioritizing drug targets. We analyzed 797 primary tumors from The Cancer Genome Atlas breast and ovarian cancer cohorts using our framework. FGFR, CDK and HER2 inhibitors were prioritized in breast and ovarian data sets. Estrogen receptor positive breast tumors showed potential sensitivity to targeted inhibitors of FGFR due to activation of FGFR3. Conclusions: Our results suggest that computational sample stratification selects potentially sensitive samples for targeted therapies and can aid in precision medicine drug repositioning. Source code is available from http://csblcanges.fimm.fi/GOPredict/.Peer reviewe

    Data integration to prioritize drugs using genomics and curated data

    Get PDF
    Background: Genomic alterations affecting drug target proteins occur in several tumor types and are prime candidates for patient-specific tailored treatments. Increasingly, patients likely to benefit from targeted cancer therapy are selected based on molecular alterations. The selection of a precision therapy benefiting most patients is challenging but can be enhanced with integration of multiple types of molecular data. Data integration approaches for drug prioritization have successfully integrated diverse molecular data but do not take full advantage of existing data and literature. Results: We have built a knowledge-base which connects data from public databases with molecular results from over 2200 tumors, signaling pathways and drug-target databases. Moreover, we have developed a data mining algorithm to effectively utilize this heterogeneous knowledge-base. Our algorithm is designed to facilitate retargeting of existing drugs by stratifying samples and prioritizing drug targets. We analyzed 797 primary tumors from The Cancer Genome Atlas breast and ovarian cancer cohorts using our framework. FGFR, CDK and HER2 inhibitors were prioritized in breast and ovarian data sets. Estrogen receptor positive breast tumors showed potential sensitivity to targeted inhibitors of FGFR due to activation of FGFR3. Conclusions: Our results suggest that computational sample stratification selects potentially sensitive samples for targeted therapies and can aid in precision medicine drug repositioning. Source code is available from http://csblcanges.fimm.fi/GOPredict/.Peer reviewe

    Let-7 microRNA controls invasion-promoting lysosomal changes via the oncogenic transcription factor myeloid zinc finger-1

    Get PDF
    Cancer cells utilize lysosomes for invasion and metastasis. Myeloid Zinc Finger1 (MZF1) is an ErbB2-responsive transcription factor that promotes invasion of breast cancer cells via upregulation of lysosomal cathepsins B and L. Here we identify let-7 microRNA, a well-known tumor suppressor in breast cancer, as a direct negative regulator of MZF1. Analysis of primary breast cancer tissues reveals a gradual upregulation of MZF1 from normal breast epithelium to invasive ductal carcinoma and a negative correlation between several let-7 family members and MZF1 mRNA, suggesting that the inverse regulatory relationship between let-7 and MZF1 may play a role in the development of invasive breast cancer. Furthermore, we show that MZF1 regulates lysosome trafficking in ErbB2-positive breast cancer cells. In line with this, MZF1 depletion or let-7 expression inhibits invasion-promoting anterograde trafficking of lysosomes and invasion of ErbB2-expressing MCF7 spheres. The results presented here link MZF1 and let-7 to lysosomal processes in ErbB2-positive breast cancer cells that in non-cancerous cells have primarily been connected to the transcription factor EB. Identifying MZF1 and let-7 as regulators of lysosome distribution in invasive breast cancer cells, uncouples cancer-associated, invasion-promoting lysosomal alterations from normal lysosomal functions and thus opens up new possibilities for the therapeutic targeting of cancer lysosomes.Peer reviewe
    corecore